If we want to identify groups or clusters using expression data, we often want to validate those groups in some way, such as determining the “correct” number of groups or checking for agreement within clusters.
In this notebook, we’ll use the medulloblastoma data from the OpenPBTA project to demonstrate a technique for cluster validation and look how clusters overlap with sample labels that are available in our metadata.
# Bit o' data wranglin'
library(tidyverse)
# Consensus clustering library
library(ConsensusClusterPlus)
# We'll make a 3D plot with this library
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ComplexHeatmap':
add_heatmap
The following object is masked from 'package:IRanges':
slice
The following object is masked from 'package:S4Vectors':
rename
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Directory with the prepared/lightly cleaned OpenPBTA data
data_dir <- file.path("data", "open-pbta", "processed")
# Create a directory to hold our cluster validation results if it doesn't
# exist yet
results_dir <- "results"
if (!dir.exists(results_dir)) {
dir.create(results_dir, recursive = TRUE)
}
histologies_file <- file.path(data_dir, "pbta-histologies-stranded-rnaseq.tsv")
rnaseq_file <- file.path(data_dir, "pbta-vst-stranded.tsv")
cc_results_file <- file.path(results_dir, "consensus_clustering_results.RDS")
Let’s read in the sample metadata and get the sample identifiers for medulloblastoma samples.
# Read in metadata TSV
histologies_df <- read_tsv(histologies_file)
Parsed with column specification:
cols(
.default = col_character(),
OS_days = col_double(),
age_last_update_days = col_double()
)
See spec(...) for full column specifications.
# Only pull out sample identifiers (KidsFirst biospecimen identifiers) that
# correspond to medulloblastoma samples
medulloblastoma_samples <- histologies_df %>%
filter(short_histology == "Medulloblastoma") %>%
pull(Kids_First_Biospecimen_ID)
# Read in transformed RNA-seq data
rnaseq_df <- read_tsv(rnaseq_file)
Parsed with column specification:
cols(
.default = col_double(),
gene_id = col_character()
)
See spec(...) for full column specifications.
# For our clustering validation analyses, we want a matrix of only
# medulloblastoma samples where the gene identifiers are rownames rather than
# in the first column
rnaseq_mat <- rnaseq_df %>%
# This makes sure we retain all of the columns with biospecimen IDs that
# correspond to medulloblastoma samples
select(gene_id, all_of(medulloblastoma_samples)) %>%
tibble::column_to_rownames("gene_id") %>%
as.matrix()
The method we’ll use to examine the stability of clusters is called consensus clustering. Consensus clustering aims to finds the “consensus” across multiple runs of the algorithm using a resampling procedure.
We’ll use the package ConsensusClusterPlus that we loaded up top (vignette).
The consensus clustering methodology was first introduced in Monti et al. (2003) Let’s look at some simulated data and results from this publication to see what consensus clustering results look like in an ideal case where we know how many clusters there are!
Consensus clustering outputs something called a consensus index that tells us how often two samples are clustered together over multiple runs of the algorithm (0 = never, 1 = always). The consensus index values can be visualized in a consensus matrix. Here’s a consensus matrix from simulated data with three groups. Because this shows the relationship between samples, rows and columns correspond to samples. A cell will be white if the consensus index = 0 and red if the consensus index = 1.
There are 3 groups or clusters, represented by the 3 red blocks on the diagonal, of samples that always cluster together and never cluster with samples outside of that cluster.
We can also look at the cumulative distribution function (CDF) of the consensus matrix to get an idea of at what number of clusters (k) the CDF is maximized. The delta plot shoes use the relative increase in the CDF between k and k - 1. Notice how there’s a drop at k = 3 in this simulated example.
Next, let’s look at a simulated example where there are 5 groups.
Notice how the consensus matrix at k = 5 looks “cleaner” than the one for k = 4 and there appears to be an increase in the area under the CDF between k = 4 and k = 5.
All figures from Monti et al. (2003).
Consensus clustering is one way to help you determine the number of clusters in your data, but it is not the only methodology available. Check out the Data Novia course Cluster Validation Essentials by Alboukadel Kassambara for a deeper dive.
cc_results <- ConsensusClusterPlus(rnaseq_mat,
maxK = 15,
# Setting this seed is necessary for the
# results to be reproducible
seed = 2020,
innerLinkage = "average",
finalLinkage = "average",
distance = "pearson")
end fraction
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
clustered
Our results are not as clean as the simulated data (surprise, surprise)! We see an increase on the delta plot in several places: k = 5, k = 7, k = 9, and k = 12.
If we look at the tracking plot that tells us how the samples group together and the consensus index plots, we can see 3 main clusters that arise around k = 9 but we could also make an argument for the two groups that show up at k 5 through 8.
We’ll move forward with the k = 9 labels, but we hope you appreciate that determining the number clusters is not an easy problem!
Let’s take a look at the class labels for k = 9.
# table() creates a contingency table of counts
table(cc_results[[9]]$consensusClass)
1 2 3 4 5 6 7 8 9
76 1 25 11 2 2 2 1 1
(Note: the numbering of the clusters is arbitrary here.)
A cluster of a few samples may not be that helpful in reaching our analysis goals!
Let’s save the entirety of the consensus clustering results to file.
# Write consensus clustering results to file
write_rds(cc_results, path = cc_results_file)
And now to extract the samples in the clusters of interest. What we used table() on before is actually a named vector.
cc_cluster_labels <- cc_results[[9]]$consensusClass
head(cc_cluster_labels)
BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT BS_1QXEC43H BS_1TWCV047 BS_2SFTWNVE
1 1 2 3 1 1
There are multiple medulloblastoma molecular subtypes and this classification largely relies on gene expression data. A medulloblastoma subtype classifier, which is an example of supervised machine learning, has been applied to the medulloblastoma samples included in OpenPBTA. How do the subtype labels from this classifier (in the molecular_subtype column of our sample metadata) stack up to the clusters we identified with unsupervised methods?
Let’s first make a data frame that holds the subtype labels. We can use this both to compare our unsupervised clustering results to the subtype labels and for some plotting downstream.
mb_molecular_subtype_df <- histologies_df %>%
filter(short_histology == "Medulloblastoma") %>%
select(Kids_First_Biospecimen_ID, molecular_subtype)
Add the consensus clustering labels.
# Create a data frame that contains the consensus cluster results and join
# it with the data frame of molecular subtype labels
cc_df <- data.frame(cc_cluster_labels) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
Do the consensus clustering results agree with the molecular subtype labels?
table(cc_df$cc_cluster_labels, cc_df$molecular_subtype)
Group3 Group4 SHH WNT
1 13 55 6 2
2 1 0 0 0
3 0 7 18 0
4 1 2 0 8
5 0 0 1 1
6 0 0 2 0
7 0 2 0 0
8 0 0 1 0
9 0 1 0 0
Hm… there’s some agreement with the subtype labels but it’s not perfect. Why might that be? Let’s start by looking at the overview figure for the medulloblastoma classifier.
We used a different measure (VST values vs. FPKM) and used all features.
Let’s see if low-dimensional representations agree with our other unsupervised results. First, let’s find the high variance genes; we’ll perform dimension reduction on those genes only.
# Calculate variance
gene_variance <- matrixStats::rowVars(rnaseq_mat)
# Find the value that we'll use as a threshold to filter the top 5%
variance_threshold <- quantile(gene_variance, 0.95)
# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)
First, let’s use UMAP.
# Set seed for reproducible UMAP results
set.seed(2020)
# umap() expects features (genes) to be columns, so we have to use t()
umap_results <- umap::umap(t(rnaseq_mat[high_variance_index, ]))
The UMAP coordinates are in the layout element of the list returned by umap::umap().
# Make a data frame of the layout results and join with molecular subtype
# data frame
umap_plot_df <- data.frame(umap_results$layout) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
Let’s make a scatter plot and color our samples by the subtype labels.
umap_plot_df %>%
ggplot(aes(x = X1,
y = X2,
color = molecular_subtype)) +
geom_point() +
scale_color_manual(values = palette.colors(palette = "Okabe-Ito")) +
theme_bw() +
xlab("UMAP1") +
ylab("UMAP2")
Now, we’ll briefly show you how to use built-in functions for PCA on any matrix that’s not in a specialized object for some kind of genomic data.
# Like umap(), prcomp() expects the features to be columns
pca_results <- prcomp(t(rnaseq_mat[high_variance_index, ]),
scale = TRUE)
# The principal component representation is returned in x
pca_results$x[1:6, 1:6]
PC1 PC2 PC3 PC4 PC5 PC6
BS_09Z7TC35 -10.305658 -30.486253 21.016270 -1.004493 11.720278 -8.9093995
BS_1AYRM596 -12.936747 4.858999 7.484443 -9.923455 1.800493 14.2377837
BS_1BWP5MCT -3.198626 -37.832010 9.673423 11.902507 7.611624 -11.4215716
BS_1QXEC43H 37.056562 13.450708 7.197838 1.012471 5.170360 -5.4456082
BS_1TWCV047 -17.895880 -13.279184 1.373011 14.065867 6.912049 -0.1576687
BS_2SFTWNVE -15.486259 21.313599 -14.310736 13.851413 -12.197892 3.5925658
Let’s get the first 10 PCs ready for plotting.
pca_plot_df <- data.frame(pca_results$x[, 1:10]) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
And now make a scatterplot of PC1 and PC2.
pca_plot_df %>%
ggplot(aes(x = PC1,
y = PC2,
color = molecular_subtype)) +
geom_point() +
scale_color_manual(values = palette.colors(palette = "Okabe-Ito")) +
theme_bw() +
xlab("PC1") +
ylab("PC2")
What about PC2 and PC3?
pca_plot_df %>%
ggplot(aes(x = PC2,
y = PC3,
color = molecular_subtype)) +
geom_point() +
scale_color_manual(values = palette.colors(palette = "Okabe-Ito")) +
theme_bw() +
xlab("PC2") +
ylab("PC3")
It can be useful to understand the proportion of variance explained by each principal component when visualizing and interpreting the results. For example, if PC1 explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned.
summary() will report the proportion of variance explained by each principal component. By accessing the importance element with <summary results>$importance, we can use indexing to only look at the first 10 PCs.
# Save summary of the PCA results
pca_summary <- summary(pca_results)
# Importance information for the first 10 PCs
pca_importance <- pca_summary$importance[, 1:10]
pca_importance
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 21.76396 17.86412 14.80238 12.21540 9.649902 8.56367
Proportion of Variance 0.18819 0.12679 0.08705 0.05928 0.037000 0.02914
Cumulative Proportion 0.18819 0.31498 0.40203 0.46131 0.498310 0.52745
PC7 PC8 PC9 PC10
Standard deviation 7.710388 7.63968 6.859131 6.545061
Proportion of Variance 0.023620 0.02319 0.018690 0.017020
Cumulative Proportion 0.551060 0.57425 0.592950 0.609960
You might consider adding these values to your axis labels to add readers in interpretation. (DESeq2::plotPCA(), which we used in the bulk RNA-seq module, does this for us!)
plotly is a package that allows us to plot interactive 3D scatterplots.
# plot_ly is kind of like ggplot(), but it doesn't require us to specify the
# aesthetics or types (e.g., geoms) separately in the same way
pca_3d_plot <- plot_ly(data = pca_plot_df,
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
type = "scatter3d",
color = ~ molecular_subtype,
colors = colorblindr::palette_OkabeIto[1:4])
# Add axis labels
pca_3d_plot <- pca_3d_plot %>%
layout(scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
))
pca_3d_plot
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] plotly_4.9.2.1 ConsensusClusterPlus_1.50.0
[3] ComplexHeatmap_2.2.0 scran_1.14.6
[5] scater_1.14.6 SingleCellExperiment_1.8.0
[7] magrittr_1.5 DESeq2_1.26.0
[9] SummarizedExperiment_1.16.1 DelayedArray_0.12.3
[11] BiocParallel_1.20.1 matrixStats_0.55.0
[13] Biobase_2.46.0 GenomicRanges_1.38.0
[15] GenomeInfoDb_1.22.1 IRanges_2.20.2
[17] S4Vectors_0.24.4 BiocGenerics_0.32.0
[19] tximport_1.14.2 forcats_0.4.0
[21] stringr_1.4.0 dplyr_1.0.0
[23] purrr_0.3.4 tidyr_1.0.0
[25] tibble_3.0.3 ggplot2_3.3.0
[27] tidyverse_1.3.0 readr_1.3.1
loaded via a namespace (and not attached):
[1] circlize_0.4.8 readxl_1.3.1 backports_1.1.8
[4] Hmisc_4.4-0 igraph_1.2.4.2 lazyeval_0.2.2
[7] splines_3.6.1 crosstalk_1.1.0.1 digest_0.6.25
[10] htmltools_0.4.0 viridis_0.5.1 fansi_0.4.1
[13] checkmate_1.9.4 memoise_1.1.0 cluster_2.1.0
[16] limma_3.42.2 annotate_1.64.0 modelr_0.1.5
[19] askpass_1.1 jpeg_0.1-8.1 colorspace_1.4-1
[22] blob_1.2.0 rvest_0.3.5 haven_2.2.0
[25] xfun_0.15 crayon_1.3.4 RCurl_1.95-4.12
[28] jsonlite_1.6.1 genefilter_1.68.0 survival_3.1-12
[31] glue_1.4.1 gtable_0.3.0 zlibbioc_1.32.0
[34] XVector_0.26.0 GetoptLong_0.1.8 BiocSingular_1.2.2
[37] shape_1.4.4 fishpond_1.2.0 scales_1.1.0
[40] pheatmap_1.0.12 DBI_1.1.0 edgeR_3.28.1
[43] Rcpp_1.0.5 viridisLite_0.3.0 xtable_1.8-4
[46] htmlTable_1.13.3 clue_0.3-57 reticulate_1.14
[49] dqrng_0.2.1 foreign_0.8-71 bit_1.1-14
[52] rsvd_1.0.3 Formula_1.2-3 umap_0.2.6.0
[55] htmlwidgets_1.5.1 httr_1.4.1 RColorBrewer_1.1-2
[58] acepack_1.4.1 ellipsis_0.3.1 pkgconfig_2.0.3
[61] XML_3.99-0.3 farver_2.0.3 uwot_0.1.8
[64] nnet_7.3-12 dbplyr_1.4.2 locfit_1.5-9.4
[67] tidyselect_1.1.0 labeling_0.3 rlang_0.4.7
[70] AnnotationDbi_1.48.0 munsell_0.5.0 cellranger_1.1.0
[73] tools_3.6.1 cli_2.0.2 generics_0.0.2
[76] RSQLite_2.2.0 broom_0.5.6 evaluate_0.14
[79] yaml_2.2.1 knitr_1.29 bit64_0.9-7
[82] fs_1.4.1 nlme_3.1-140 xml2_1.3.1
[85] compiler_3.6.1 rstudioapi_0.11 beeswarm_0.2.3
[88] png_0.1-7 reprex_0.3.0 statmod_1.4.34
[91] geneplotter_1.64.0 stringi_1.4.6 RSpectra_0.16-0
[94] lattice_0.20-38 Matrix_1.2-17 vctrs_0.3.2
[97] pillar_1.4.6 lifecycle_0.2.0 colorblindr_0.1.0
[100] GlobalOptions_0.1.1 RcppAnnoy_0.0.16 BiocNeighbors_1.4.2
[103] data.table_1.12.8 bitops_1.0-6 irlba_2.3.3
[106] R6_2.4.1 latticeExtra_0.6-29 gridExtra_2.3
[109] vipor_0.4.5 codetools_0.2-16 gtools_3.8.2
[112] assertthat_0.2.1 openssl_1.4.1 rjson_0.2.20
[115] rprojroot_1.3-2 withr_2.2.0 GenomeInfoDbData_1.2.2
[118] hms_0.5.3 rpart_4.1-15 rmarkdown_2.1
[121] DelayedMatrixStats_1.8.0 Rtsne_0.15 lubridate_1.7.4
[124] base64enc_0.1-3 ggbeeswarm_0.6.0